Introduction

The Dataset:

The current dataset derives from Rapsodo Baseball numbers retrieved from the Fall baseball season of 2023 at the University of Hawai’i at Hilo. The players of the team were informed that their data would be used for data manipulation and visualization and credit would be given where due for their contributions.

Why UH Hilo Pitching data:

The objective of using this dataset is to analyze and enhance our understanding of player performance, focusing on metrics such as pitch velocity, spin rate, BMI, and many other key factors that affect pitcher performance.

This analysis aims to identify areas of strength and opportunities for improvement that will contribute to the rapidly growing data-based analytics that are being implemented at the collegiate level of baseball. The aim is to utilize the data to complement the adjustments that our UH Hilo pitching coach, Kallen Miyatki, implements into our daily drills, bullpens, breakdown, arm care, as well as recovery programs.

The insights gained from this study are expected to improve not only the University of Hawai’i at Hilo pitching staff, but will also benefit the future of data and sports analytics.

Part I

Descriptive Statistics:

In summary, the descriptive statistics were run based on fastball and changeup velocity numbers for each player. The descriptive statistical tests include the mean, median, mode, IQR, standard deviation, standard error, and Shapiro-Wilks test. The Shapiro-Wilks test for normality was run to examine how close the data fit to the normal distribution. If the data is found close to the normal distribution, that in turn helps pitching analysts better understand what factors can lead to peak performance. This also works in contrast, to what factors will cause one to deviate from normal distribution.

Additionally, I ran a linear regression analysis based on the height and weight of each player to see if they affected fastball velocity. A linear regression analysis was performed to determine if there is a linear relationship between height and weight being factors to increase fastball velocity. It is a common saying that for a pitcher to gain any velocity, they must gain weight. If there is a linear model that can be displayed, it will allow the players to understand a niche weight for their height to gain velocity.

Another statistical test that was performed was a Mann-Whitney test to compare the differences between both left and right-handedness pitchers. The dependent variable that was being studied was if the pitch was a strike. To prevent skewness, I made sure to include every pitch within the dataset. The significance behind performing a Mann-Whitney test is that the feedback can be utilized on the field, breaking down which pitchers need to focus on throwing more strikes.

#install.packages("dplyr")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#install.packages("tidyverse")
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.3     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#header = TRUE to specify that headers are as is
fall23bbSet = read.csv("fall2023UHHilo.csv", header = TRUE, sep = ",")
fall23bb = na.omit(fall23bbSet)

#filtering for fastball velocity
fbVelo = as.numeric(fall23bb[fall23bb$Pitch.Type == "Fastball", "Velocity"])

#filtering for changeup velocity
chVelo = as.numeric(fall23bb[fall23bb$Pitch.Type == "ChangeUp", "Velocity"])

#descriptive stats of fastball and changeup velocity
#mean
meanFbVelo = mean(fbVelo, na.rm = TRUE)
print(meanFbVelo)
## [1] 82.78693
meanChVelo = mean(chVelo, na.rm = TRUE)
print(meanChVelo)
## [1] 72.43179
#median
medianFbVelo = median(fbVelo, na.rm = TRUE)
print(medianFbVelo)
## [1] 82.68
medianChVelo = median(chVelo, na.rm = TRUE)
print(medianChVelo)
## [1] 72.88
#mode
#defining mode funciton
Mode = function(x) {
  ux = unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}
modeFbVelo = Mode(fbVelo)
print(modeFbVelo)
## [1] 84.57
modeChVelo = Mode(chVelo)
print(modeChVelo)
## [1] 73.85
#IQR
iqrFbVelo = IQR(fbVelo, na.rm = TRUE)
print(iqrFbVelo)
## [1] 7.54
iqrChVelo = IQR(chVelo, na.rm = TRUE)
print(iqrChVelo)
## [1] 7.82
#standard error
seFbVelo = sd(fbVelo, na.rm = TRUE) / sqrt(sum(!is.na(fbVelo)))
print(seFbVelo)
## [1] 0.2899949
seChVelo = sd(chVelo, na.rm = TRUE) / sqrt(sum(!is.na(chVelo)))
print(seChVelo)
## [1] 0.4745072
#standard deviation
sdFbVelo = sd(fbVelo, na.rm = TRUE)
print(sdFbVelo)
## [1] 4.963912
sdChVelo = sd(chVelo, na.rm = TRUE)
print(sdChVelo)
## [1] 4.624924
#shapiro-wilks test
shapiroFbVelo = shapiro.test(fbVelo)
print(shapiroFbVelo)
## 
##  Shapiro-Wilk normality test
## 
## data:  fbVelo
## W = 0.97486, p-value = 5.096e-05
shapiroChVelo = shapiro.test(chVelo)
print(shapiroChVelo)
## 
##  Shapiro-Wilk normality test
## 
## data:  chVelo
## W = 0.97844, p-value = 0.1187
#linear regression analysis on height and weight affect on velocity
#subset fastball velocity for multiple regression
fbVeloData = fall23bb[fall23bb$Pitch.Type == "Fastball", c("Velocity", "Height", "Weight")]

lmPitchingVelo = lm(Velocity ~ Height + Weight, data = fbVeloData)
summary(lmPitchingVelo)
## 
## Call:
## lm(formula = Velocity ~ Height + Weight, data = fbVeloData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.6195  -2.8596   0.1329   3.0813   9.0146 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 37.34325    5.21918   7.155 6.87e-12 ***
## Height       0.25408    0.07280   3.490 0.000557 ***
## Weight       0.14235    0.02232   6.377 7.10e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.366 on 290 degrees of freedom
## Multiple R-squared:  0.2317, Adjusted R-squared:  0.2264 
## F-statistic: 43.72 on 2 and 290 DF,  p-value: < 2.2e-16
#Mann-Whitney test
pitcherHandedness = factor(fall23bb$Handedness, levels = c("R", "L"))
fall23bb$Is.Strike = as.numeric(fall23bb$Is.Strike == "YES")
mannWhitneyStike = wilcox.test(Is.Strike ~ Handedness, data = fall23bb)
print(mannWhitneyStike)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  Is.Strike by Handedness
## W = 45214, p-value = 0.6678
## alternative hypothesis: true location shift is not equal to 0

Part II

Data Visualization:

The plots that were utilized for data visualization inlcudes a linear regression scatter plots, a box plot, and one 3D scatter plot.

The first linear regression scatter plot model was used to visualize the regression analysis based on height and weight and their respective affects on fastball velocity. A linear regression was chosen due to the fact that it is a clean and precise way to visualize the data points, especially with the implementation of the regression line.

The box plot is utilized to display the Mann-Whitney test results. I decided that it would be most convenient to have both box plots displayed on the same figure that way the comparison can be more easily made between both fastball and changeup velocities.

The 3-Dimensional scatter plot was implemented to display the effect that total spin has on pitch velocity. I chose to implement player name on to one axis to provide information for the players on where their velocity stacks up among members of the pitching staff. Additionally, I included the pitchers median fastball velocity, this way they are provided feedback for their difference from the median. This is a valuable piece of information for data visualization because the player is informed about the difference between their top and their median, essential for implementing pitch design as well as understanding their role on the pitching staff.

#install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
#install.packages("ggplot2")
library(ggplot2)

#linear regression analysis height and weight plot
ggplot(fbVeloData, aes(x = Height, y = Velocity, color = Weight)) +
  geom_point() +
  geom_smooth(method = "lm", aes(group = 1), color = "black") +
  theme_minimal() +
  labs(title = "Linear Regression Analysis of Pitching Velocity")
## `geom_smooth()` using formula = 'y ~ x'

#Mann-Whitney box plot test results
combinedPitchVelo = fall23bb[fall23bb$Pitch.Type %in% c("Fastball", "ChangeUp"), c("Velocity", "Pitch.Type")]

boxplot(Velocity ~ Pitch.Type, data = combinedPitchVelo,
        main = "Comparison of Pitch Velocities",
        xlab = "Pitch Type",
        ylab = "Velocity",
        horizontal = TRUE,
        col = c("salmon", "lightblue"))

#3D Scatter plot for player top velocity as well as their median Fastball velocity
topPlayerVelo = fall23bb %>%
  filter(Pitch.Type == "Fastball") %>%
  group_by(Player.Name) %>%
  summarise(topVeloFb = max(Velocity), Total.Spin = first(Total.Spin[which.max(Velocity)]))

medianTopVelo = median(topPlayerVelo$topVeloFb, na.rm = TRUE)

plot_ly(data = topPlayerVelo, x = ~Player.Name, y = ~Total.Spin, z = ~topVeloFb,
                       type = "scatter3d", mode = "markers",
                       marker = list(size = 4, colorscale = 'Viridis'), name = "Top Velocity") %>%
  add_markers(data = topPlayerVelo,
              x = ~Player.Name, y = ~Total.Spin, z = rep(medianTopVelo, nrow(topPlayerVelo)),
              marker = list(color = 'red', size = 2, symbol = 'square'), name = "Median Velocity") %>%
  layout(
    scene = list(
      xaxis = list(title = 'Player Name'),
      yaxis = list(title = 'Total Spin'),
      zaxis = list(title = 'Top Velocity (Fastball)')),
    title = "3D Scatter Plot of Player Name, Total Spin, 
    and Top Velocity (Fastball) with Median Velocity per Pitcher")

Part III

Data Management & Visualization:

I have decided to create a data frame and calculate the body mass index (BMI) for each player on the UH Hilo baseball team. Additionally, the data was separated into pitchers who are over 72 in tall (6ft) and those who are 72 in and under. The data was then used to create a scatter plot that implements a linear regression displaying the relationship between BMI and fastball velocity for pitchers both 72 in and under, and over 72 in. This was then followed up with a t-test for testing the two means of fastball velocity and BMI.

With the creation of the data frame, I then created a heat map to display the fastball velocities based on the BMI of each player. The heat map is a different way to display the relationship between fastball velocity and BMI. I included the height and weight as the axis variables, utilizing the velocity as the heating element, and the BMI was represented with the dot size. The heat map compiles all of these different factors into a visualization that can be more easily interpreted.

I decided to subset all of the data from the data set to extract only fastball data and the corresponding data such as the name of the player, total spin, handedness, and velocity. Then, I went on to handle all of the possible not-a-number (NaN) cells within the data set by filtering them out. Then, I created another linear regression scatter plot to display the relationship between total spin and pitch velocity based on fastball data and handedness.

#BMI data frame
bmiData = fall23bb %>%
  group_by(Player.Name) %>%
  summarise(
    Weight = first(Weight),
    Height = first(Height),
    BMI = (first(Weight) / (first(Height)^2)) * 703)
print(bmiData)
## # A tibble: 20 × 4
##    Player.Name                           Weight Height   BMI
##    <chr>                                  <int>  <dbl> <dbl>
##  1 "Aaron Davies"                           175   72    23.7
##  2 "Braden Lowe"                            195   73    25.7
##  3 "Christian Wood"                         189   74.4  24.0
##  4 "Connor Dougal"                          215   78    24.8
##  5 "Devin Hayashi"                          150   66    24.2
##  6 "Dylan Montague"                         175   77    20.7
##  7 "Ethan Salscheider"                      190   70    27.3
##  8 "Haokeakumehokealani Kekahuna-Tomita"    175   68    26.6
##  9 "Jake Liberta"                           195   72    26.4
## 10 "James Yamasaki"                         190   72    25.8
## 11 "Matthew O'Brien"                        185   72    25.1
## 12 "Nick Agacki"                            195   81.6  20.6
## 13 "Orlando  Leon Jr "                      205   72    27.8
## 14 "Santiago Velarde"                       229   77    27.2
## 15 "Sebastian Garcia"                       188   74    24.1
## 16 "Stephen Perry"                          185   73    24.4
## 17 "Troy frazier"                           160   66    25.8
## 18 "Ty Honda"                               165   70    23.7
## 19 "devin meyer"                            185   69    27.3
## 20 "luke Dickson"                           190   75    23.7
fastballData = fall23bb %>%
  filter(Pitch.Type == "Fastball")

fastballBmiData = fastballData %>%
  left_join(bmiData, by = "Player.Name")
print(head(fastballBmiData, n = 10))
##    Handedness Player.ID    Player.Name                       Date   Pitch.ID
## 1           R    887109 Christian Wood Sat Oct 07 2023 2:34:33 AM 1696646073
## 2           R    887109 Christian Wood Sat Oct 07 2023 2:35:25 AM 1696646125
## 3           R    887109 Christian Wood Sat Oct 07 2023 2:36:31 AM 1696646191
## 4           R    887109 Christian Wood Sat Oct 07 2023 2:36:48 AM 1696646208
## 5           R    887109 Christian Wood Sat Oct 07 2023 2:38:01 AM 1696646281
## 6           R    887109 Christian Wood Sat Oct 07 2023 2:39:19 AM 1696646359
## 7           R    909545       Ty Honda Mon Oct 02 2023 1:26:11 AM 1696209971
## 8           R    909545       Ty Honda Mon Oct 02 2023 1:26:29 AM 1696209989
## 9           R    909545       Ty Honda Mon Oct 02 2023 1:28:21 AM 1696210101
## 10          R    909545       Ty Honda Mon Oct 02 2023 1:29:20 AM 1696210160
##    Pitch.Type Is.Strike Strike.Zone.Side Strike.Zone.Height Velocity Total.Spin
## 1    Fastball         1             3.01              27.17    78.73     2172.3
## 2    Fastball         0            17.09              40.09    78.64     1936.1
## 3    Fastball         0             5.66              45.07    78.27     1828.2
## 4    Fastball         1             5.27              37.48    78.38     2229.8
## 5    Fastball         1             3.58              29.52    79.96       1941
## 6    Fastball         1             8.89              40.16    78.07     2164.2
## 7    Fastball         1            -5.37              27.89    78.33     2186.5
## 8    Fastball         0            -8.52              10.21    78.75     2129.6
## 9    Fastball         1            -2.53              34.21    78.47     2103.6
## 10   Fastball         1            -1.11              34.57    82.77     2213.2
##    True.Spin..release. Spin.Efficiency..release. Spin.Direction Spin.Confidence
## 1               2172.2                       100           2:00             0.7
## 2               1935.4                       100           2:00             0.8
## 3               1825.8                      99.9           1:54             0.8
## 4               2227.7                      99.9           2:06             0.7
## 5               1936.3                      99.8           2:06             0.9
## 6               2162.4                      99.9           2:02             0.8
## 7               2068.6                      94.6           0:50             0.8
## 8               2091.4                      98.2           0:38             0.8
## 9               2067.3                      98.3           0:44             0.8
## 10              2045.7                      92.4           0:50             0.8
##    Release.Extension..ft. VB..trajectory. HB..trajectory. VB..spin. HB..spin.
## 1                       -            10.7           17.32      10.7      18.9
## 2                       -           10.11           15.55      10.1      17.9
## 3                       -           10.89           13.63      10.9      16.9
## 4                       -            9.94           15.79       9.9      19.5
## 5                       -            9.37           18.68       9.4      18.2
## 6                       -           10.48           20.96      10.5        19
## 7                       -           18.65            4.72      18.6       8.8
## 8                       -           20.15            7.21      20.2       6.9
## 9                       -           19.55            8.54      19.6         8
## 10                      -           17.87            5.35      17.9       8.4
##    Horizontal.Angle Release.Angle Release.Height Release.Side Gyro.Degree..deg.
## 1             -3.93          -0.9           4.25         1.67             -0.35
## 2             -2.78          0.44           4.54         2.21             -1.55
## 3             -3.69          0.81           4.57         1.95              2.95
## 4             -3.75         -0.27           4.52         1.79              2.51
## 5             -4.32          0.59           4.32         2.35             -3.99
## 6             -4.13          1.12           4.45         2.29               2.3
## 7             -3.71          -1.6           4.73         1.93             18.89
## 8             -4.02         -3.15           4.81         2.08             10.86
## 9             -3.45         -1.47           4.95         2.06             10.65
## 10            -2.96         -1.31           4.88          1.9             22.44
##            Unique.ID Height.x Weight.x Weight.y Height.y      BMI
## 1  887109@1696646073     74.4      189      189     74.4 24.00332
## 2  887109@1696646125     74.4      189      189     74.4 24.00332
## 3  887109@1696646191     74.4      189      189     74.4 24.00332
## 4  887109@1696646208     74.4      189      189     74.4 24.00332
## 5  887109@1696646281     74.4      189      189     74.4 24.00332
## 6  887109@1696646359     74.4      189      189     74.4 24.00332
## 7  909545@1696209971     70.0      165      165     70.0 23.67245
## 8  909545@1696209989     70.0      165      165     70.0 23.67245
## 9  909545@1696210101     70.0      165      165     70.0 23.67245
## 10 909545@1696210160     70.0      165      165     70.0 23.67245
fastballBmiData$HeightCategory = ifelse(fastballBmiData$Height.y > 72, "Over 72", "72 or below")

ggplot(fastballBmiData, aes(x = BMI, y = Velocity, color = HeightCategory)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_color_manual(values = c("Over 72" = "blue", "72 or below" = "red")) +
  labs(title = "Scatter Plot of BMI vs Fastball Velocity",
       x = "BMI",
       y = "Fastball Velocity",
       color = "Height Category (in)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

#T-Test for BMI and fastball velocities
tTest = t.test(Velocity ~ HeightCategory, data = fastballBmiData)
print(tTest)
## 
##  Welch Two Sample t-test
## 
## data:  Velocity by HeightCategory
## t = 2.0865, df = 284.25, p-value = 0.03783
## alternative hypothesis: true difference in means between group 72 or below and group Over 72 is not equal to 0
## 95 percent confidence interval:
##  0.06770996 2.32427587
## sample estimates:
## mean in group 72 or below     mean in group Over 72 
##                  83.36656                  82.17056
#heatmap of fastball velocity based on bmi
ggplot(fastballBmiData, aes(x = Height.y, y = Weight.y, color = Velocity, size = BMI)) +
  geom_point(alpha = 0.7) +  # Semi-transparent points
  labs(x = "Height", y = "Weight", color = "Fastball Velocity", size = "BMI") +
  ggtitle("Scatterplot: Height vs Weight with Fastball Velocity and BMI") +
  scale_color_gradient(low = "lightblue", high = "red") +
  guides(size = guide_legend(title = "BMI")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#Scatter plot of total spin vs velocity with linear regression
#subsetting fb data
fbData23 = fall23bb %>%
  filter(Pitch.Type == "Fastball") %>%
  select(Player.Name, Total.Spin, Handedness, Velocity)

#handling NA's
fbData23 = fbData23 %>%
  filter(!is.na(Velocity) & !is.na(Total.Spin))
fbData23$Velocity = as.numeric(fbData23$Velocity)
fbData23$Total.Spin = as.numeric(fbData23$Total.Spin)
print(head(fbData23, n = 10))
##       Player.Name Total.Spin Handedness Velocity
## 1  Christian Wood     2172.3          R    78.73
## 2  Christian Wood     1936.1          R    78.64
## 3  Christian Wood     1828.2          R    78.27
## 4  Christian Wood     2229.8          R    78.38
## 5  Christian Wood     1941.0          R    79.96
## 6  Christian Wood     2164.2          R    78.07
## 7        Ty Honda     2186.5          R    78.33
## 8        Ty Honda     2129.6          R    78.75
## 9        Ty Honda     2103.6          R    78.47
## 10       Ty Honda     2213.2          R    82.77
ggplot(fbData23, aes(x = Velocity, y = Total.Spin, color = Handedness)) +
  geom_point() +
  geom_smooth(method = "lm", aes(fill = Handedness), alpha = 0.2, na.rm = TRUE) +
  labs(title = "Scatter plot of Total Spin vs Velocity with Linear Regression", x = "Velocity", y = "Total Spin") +
  scale_color_manual(values = c("L" = "skyblue", "R" = "pink")) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'